Humanity’s Last Exam

https://gyazo.com/3f564decb774a86340ce84701c10b083

「Humanity's Last Exam」と名付けられたこの新たなベンチマークは、AIシステムが数学、人文科学、自然科学などの幅広い分野で、世界トップクラスの専門家レベルの推論能力と知識能力を達成したかどうかを評価したものです。秋を通して、CAISとScale AIは専門家からの質問をクラウドソーシングし、AIモデルに切り込むための最も困難で広範な問題をまとめました。この試験は、既存のテストでほぼ満点のスコアを定期的に達成するモデルであるが、それらのテスト以外の問題には答えられない可能性がある「ベンチマーク飽和 (benchmark saturation)」の課題に対処するために開発されました。飽和状態は、将来のモデルの進行状況を正確に測定するベンチマークの有用性を低下させます。

2501.14249 Humanity's Last Exam

enchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at this https URL.

ベンチマークは、大規模言語モデル（LLM）機能の急速な進歩を追跡するための重要なツールです。しかし、ベンチマークは難易度の点で追いついていません。LLMは現在、MMLUなどの一般的なベンチマークで90％以上の精度を達成しており、最先端のLLM機能の情報に基づいた測定が制限されています。この対応として、我々は人類の知識の最先端におけるマルチモーダルベンチマークであるHumanity's Last Exam（HLE）を紹介します。これは、幅広い科目を網羅する、この種の最後のクローズドエンド型学術ベンチマークとなるように設計されています。HLEは、数学、人文科学、自然科学など、数十の科目にわたる2,500の質問で構成されています。HLEは、分野の専門家によって世界的に開発されており、自動採点に適した多肢選択式および短答式の質問で構成されています。各質問には、明確で簡単に検証できる既知の解答がありますが、インターネット検索ですぐに回答することはできません。最先端のLLMは、HLEの精度とキャリブレーションが低いことが示されており、現在のLLMの能力と、限定的な学術的問題における専門家の限界との間に大きなギャップがあることを浮き彫りにしています。モデルの能力を明確に理解した上で研究と政策立案に役立てるため、HLEをこちらのhttps URLで公開しています。

https://gyazo.com/cfb79f9ea0351e01fe7a7661caef2e89

AIが問題がとけすぎて差異がわからないから、むずかしい問題作ったという理解基素.icon

#Scale_AI

#Center_for_AI_Safety